Techniques for Automated Taxonomy Building: Towards Ontologies for Knowledge Management
نویسنده
چکیده
Ontologies have become widely accepted as the main method for representing knowledge in Knowledge Management (KM) applications. Given the continuous and rapid change and dynamic nature of knowledge in all fields, automated methods for constructing ontologies are of great importance. All ontologies or taxonomies currently in use have been hand built and require considerable manpower to keep up to date. Taxonomies are less logically rigorous than ontologies, and in this paper we consider the requirements for a system which automatically constructed taxonomies. There are a number of potentially useful methods for constructing hierarchically organised concepts from a collection of texts and there are a number of automatic methods which permit one to associate one word with another. The important issue for the successful development of this research area is to identify techniques for labelling the relation between two candidate terms, if one exists. We consider a number of possible approaches and argue that the majority are unsuitable for our requirements. 1. The Need for Ontologies and Taxonomies Artificial intelligence has for decades confronted the issue of ‘knowledge acquisition’ and struggled to move from representations of toy worlds to larger more realistic representations of human knowledge. One current form of this struggle has crystallised in the commercial needs of Knowledge Management (KM). In the modern ‘knowledge-based’ economy, a company’s value depends increasingly on “intangible assets” which exist in the minds of employees, in databases, in files and in a myriad documents. Knowledge management technologies capture this intangible element in an organisation; and make it universally available. The most widely used method of mapping the knowledge of a domain is to use an ontology describing such a domain. Ontologies can act as an index to the memory of an organisation and facilitate semantic searches and the retrieval of knowledge from the corporate memory as it is embodied in documents and other archives. Repeated research has shown their usefulness [include the Mädche refs], especially for specific domains (Järvelin & Kekäläinen 2000). For example, in order to successfully manage a complex knowledge network of experts, the Minneapolis company Teltech has developed an ontology of over 30,000 terms describing its domain of expertise (Davenport 1998).There are many real-world examples where the utility of ontologies as maps or models of specific domains has been repeatedly proven (Fensel et al. 2001). The use of ontologies in KM and in the Semantic Web have had particular momentum, especially the latter as they are central to Tim Berners-Lee’s vision of the Semantic Web (BernersLee et al. 2001) and the ability of agents to perform “sophisticated tasks for users”. Whatever the discipline, however, existing work on the construction of ontologies has concentrated on the formal properties and characteristics that an ontology should have in order to be useful (Gomez-Perez 1999, Guarino REFS) rather than the practical aspects of constructing one i.e. to reduce the enormous manual effort involved. The rest of this paper is organised as follows. In Part 2, we discuss some of the problems associated with knowledge acquisition as seen from the prism of building ontologies and taxonomies. In Part 3, we will consider a number of methodological criteria which arise from the context of trying to build taxonomies for knowledge management and how they affect our choice of algorithms and methods. In Parts 4 and 5, we briefly discuss some of the major techniques in the literature concerning building hierarchies, and identifying relations between terms in texts, respectively. 2. The Problem with Knowledge Acquisition Knowledge, is is widely assumed, can be codified in an ontology. An ontology has been define by Gruber (1993) as a “formal explicit specification of a shared conceptualisation” and this has been widely cited with approval (Fensel et al. 2001). Berners-Lee says “an ontology is a document or file that formally defines the relations among terms. The most typical kind of ontology for the Web has a taxonomy and a set of inference rules” (Berners-Lee et al. 2001). We see ontologies as lying on a continuum reflecting the degree of logical rigour applied in their construction. At the one extreme lie ontologies which purport to be entirely explicit in the sense that logical inferences can, in principle, be easily calculated over these structures. At the other extreme we could place pathfinder networks (Schwaneveldt 1990) or even ‘mindmaps’ (Buzan 1993), which essentially involve considerable human interpretation to be said to represent ‘knowledge’ of any form. Somewhere in between lie taxonomies and browsable hierarchies which are clearly less rigorous than a fully specified ontology. Our interest in this paper lies in the construction of taxonomies and browsable hierarchies because we believe that it is more feasible to construct these automatically or semiautomatically than fully-fledged ontologies. Gomez-Perez (1999), for example, presents very strict criteria for ontology construction concerning consistency, completeness and conciseness which may be achievable in a specific subdomain (she discusses the ‘Standard Units’ ontology) but can only be idealised objectives when dealing with wider knowledge areas. This is entirely parallel with the art of lexicography, which also aspires to exactly the same ideals, but which any experience lexicographers knows are just that: ‘ideals’. One of the major problems in this field is that it is a common conception among authors working with ontologies to assume that ordinary users will be willing to contribute to the building of a formal ontology. Thus for example, Stutt and Motta presents an imaginary scenario where an archaeologist marks up his text with ‘various’ ontologies and furthermore not finding the Problem Solving Methods (PSMs) associated with the ontologies adequate, adds to the set of existing PSMs (Stutt and Motta 2000:218). This is entirely unrealistic because there is no motivation for archaeologists to burden themselves with this kind of extra task. Similar conclusion have been drawn in industry. It was assumed given the existence of a taxonomy or ontology, authors will be willing to tag their own work in an appropriate manner but the experience of both librarians historically and more recently companies like ICL and Montgomery Watson is that authors tag inadequately or inappropriately their own work (Gilchrist and Kibby 2000). Currently ontologies and taxonomies are all hand-built. Whether we consider the general browsing hierarchies of Yahoo or Northern Lights at one extreme or the narrow scientific ontology developed by the partners of the Gene Ontology project (http://www.geneontology.org/ ), these data structures are built by manual labour. Yahoo is reputed to employ over a one hundred people to keep its taxonomy up to date (Dom 1999). Although considerable use is made of taxonomies in industry, it is clear from a number of sources that they are all the result of manual effort both in construction and maintenance. Consider this extract for example from a recent job advertisement on Dice.com: Duties: The Ontology Manager will hold a key role in meeting customer demands by maintaining the master ontology that organized content for the eBusiness initiatives. This individual will ensure the data is organized to facilitate rapid product selection .... A typical example is that of Arthur Andersen who have recently constructed a company wide taxonomy entirely by hand. Their view of the matter is that there is no alternative because the categories used come from the nature of the business rather than the content of the documents. This is paralleled by the attitude of the British Council’s information department who view that the optimum balance between human and computer, in this area, is 85:15 in favour of humans. Not all companies perceive human input as so sacrosanct; Braun GmbH for example would appreciate a tool for taxonomy creation and automatic keyword identification (Gilchrist and Kibby 2000:34). One of the earliest exponents of knowledge management, PricewaterhouseCoopers consider that “the computer can enable activity, but knowledge management is fundamentally to do with people (ibid.:118). One manner in which certain companies reduce the manual effort involved is by using readymade taxonomies provided by others. An example of this is Braun GmbH whose internal taxonomy is based on the (hand-built) resources provided by FIZ Technik (a technical thesaurus) and MESH (the medical subject headings provided by the US Library of Medicine). Nonetheless about 20% of the vocabulary in the taxonomy is generated internally to the company. Another example is the case of GlaxoWellcome (now GSK) who have integrated three legacy taxonomies derived from company mergers using a tool called Thesaurus Manager developed by the company in collaboration with Cycorp, Inc. There are major problems with the construction and maintenance of ontologies and taxonomies. First, there is the high initial cost in terms of human labour in performing the editorial task of writing the taxonomy and maintaining it. In fact, this consists of two tasks. One is the construction of the actual taxonomy and the other is associating specific content with a particular node in the taxonomy. For example, in Yahoo or the Open Directory (www.dmoz.org), there is the actual hierarchy of categories and then there are specific web sites which are associated with a particular category. Secondly, the knowledge which the taxonomy attempts to capture is in constant flux, it is changing and developing continuously. This means that if the taxonomy is built by hand, like a dictionary, it is out of date on the day of publication. Thirdly, taxonomies need to be very domain specific. Particular subject areas whether in the academic or business world have their own vocabulary and technical terminology, thus making a general ontology/taxonomy inappropriate without considerable pruning and editing. Fourthly, taxonomies reflect a particular perspective on the world, the perspective of the individuals or organisation which builds them. For example, a consulting firm has in its internal taxonomy the category ‘business opportunity’ but what artefacts fall within this category is a function of both the nature of the business and the insights the consultants have themselves. Fifth, and this is an extension of the previous issue, often the categories in a taxonomy are human constructs, abstractions reflecting a particular understanding. Thus a category like ‘business opportunity’ or even ‘nouns’ is an abstraction derived from an analytical framework and not inherent in the data itself. Finally, the fact remains that while an ontology is supposed to be a “shared conceptualisation”, it is often very difficult for human beings to agree on a particular manner to categorise the world. Given these problems there are two possible conclusions. The first three points indicate the need for maximally automated systems which reduce the manual labour involved and make it feasible to keep a taxonomy up to date. The last three points would seem to indicate that the task is not feasible or at best irrelevant. However, we have argued elsewhere for a model of ontology construction involving the judicious integration of automated methods with manual validation (Brewster et al. 2000), and this we believe is the direction to take. 2.1 Protocols, Introspection and Textual Data There are two traditional methods in KM for the acquisition of knowledge whether it is used to construct an ontology or some other from of knowledge base. The one is protocol analysis (Ericsson & Simon 1984) involving the use of structured interviews of experts in a particular domain, asking them to describe their thought process as they work and the knowledge used to make decisions or arrive at conclusions. The other is human introspection which is widely used for example in the construction of a large number of ontologies available at the Stanford Ontology Server. A parallel can be drawn with linguistics and lexicography. Traditionally in linguistics two approaches were used to write a dictionary. One, characteristic of field linguists and used when the language was obscure or entirely unknown, involved elicitation i.e. interviews with native informants. This is parallel to a protocol analysis approach. The other, characteristic of lexicographers and used for dictionaries of well-known languages, involved using everyone else’s previous dictionaries and ones own introspection. These were the methods used for most dictionary production until the late 1980’s. However, under the influence of the COBUILD initiative (Sinclair 1987), the field switched massively to the use of corpora i.e. large collections of texts either as supplemental data sources or as primary data sources. Even field linguists now make a much greater effort to collect textual artefacts (stories, songs, narratives, etc.) in their work with unknown languages. In a parallel manner, large collections of texts must represent the primary data source for the construction of ontologies and taxonomies for KM. With the rise of corporate intranets, the increasing use of emails to conduct a large proportion of business activity, and the continuous growth of textual databanks in all professions, it is clear that methods which use texts as their primary data source are the most likely to go at least some of the way towards constructing taxonomies and ‘capturing’ the knowledge required. Given the observations made above about the unwillingness of individuals to ‘add’ to a taxonomy, or ‘mark-up’ their own texts, and given the continuous change and expansion of information in all domains, using texts as the main source of data appears both efficient and inevitable. It is in this context that the focus of this paper will be on methods which can take as input collections of texts in some form or another. 3. Methodological Criteria In this section we consider a number of criteria to be used when choosing methods which process texts and produce taxonomies or components of taxonomies as their output. Our purpose here is twofold. First, we wish to create a set of criteria in order to help guide the choice of appropriate tools to use in the automatic construction of taxonomies. While there are a large number of methods which might conceivably produce appropriate output, in fact only a subset will actually fulfil these criteria. Secondly, we hope thereby to contribute to a means by which different approaches to constructing taxonomies can be evaluated as there is complete dearth of evaluative measures in this field. Writers on ontology evaluation concentrate on a limited number of criteria which are only appropriate to hand crafted logical objects (Gomez Perez 1999, Guarino & Welty 2000).
منابع مشابه
Building Large Scale Ontology Networks
Adoptable, high performing, large scale ontologies that can be extended to support multi-media play a crucial role in building effective content and knowledge management systems and applications. In the context of developing a Unified Taxonomy and Ontology Network (UTON), we have undertaken the task of developing a technology framework for building large scale ontologies. This paper describes t...
متن کاملTaxaMiner: an experimentation framework for automated taxonomy bootstrapping
Hierarchical taxonomies and thesauri are frequently used by content management systems for indexing, search and categorization. They are also being viewed as rudimentary ontologies for the emerging Semantic Web infrastructure. However, to date, development of taxonomies and thesauri are human intensive processes, requiring huge resources in terms of cost and time. It is critical that approaches...
متن کاملTowards Semi-automatic Ontology Building Supported by Large-Scale Knowledge Acquisition
Knowledge acquisition is usually the first step in building ontologies. On the one hand, knowledge is typically implicitly contained in large collections of unstructured documents. Therefore it is extremely troublesome to manually identify relevant concepts. On the other hand, users are often not fully satisfied with the results of automated stateof-the-art ontology learning techniques. In this...
متن کاملOntologies, Taxonomies, Thesauri: Learning from Texts
The use of ontologies as representations of knowledge is widespread but their construction, until recently, has been entirely manual. We argue in this paper for the use of text corpora and automated natural language processing methods for the construction of ontologies. We delineate the challenges and present criteria for the selection of appropriate methods. We distinguish three major steps in...
متن کاملTowards a cognitive foundation for knowledge representation
Knowledge engineering, knowledge management and conceptual modelling are concerned with representing knowledge of business and organizational domains. These research areas use ontologies for knowledge representation. Ontologies are understood either in the philosophical sense as firm metaphysical commitments or in the looser sense of dictionaries or taxonomies. This paper critically examines th...
متن کامل